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size ratio, the latent ability di s t r ibut i on , and item information 
level. Results showed that as the latent ability distribution departs 
from a uniform distribution the accuracy of estimating the slope 
parameter decreased. This decrease in accuracy may be compensated 
for, in part, by increasing the sample size. Moreover, more 
informative items tended not to be as well estimated as less 
informative items. The results appear to indicate that if one is 
interested in estimating ability, a sample size ratio of 5:1 can 
produce reasonably accurate item parameter estimates for this 
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ABSTRACT 

This study extended item parameter recovery studies in item response theory to the nominal 
response model (NRM). The NRM may be used with computerized adaptive testing, testlets, 
demographic items, and items whose alternatives provide educational diagnostic information. 
Moreover, with the increasing popularity of performance-based assessment, the use of polytomous 
item response theory models, in general, and the NRM in particular, will more than likely see 
increase application. Establishing guidelines for reasonable item parameter estimation was seen 
as fundamental to the the use of the NRM. Factors studied were ihe sample size ratio, the latent 
ability distribution, and item information level. Results showed that as the latent ability 
distribution departs from a uniform distribution the accuracy of estimating the slope parameter 
decreased. This decrease in accuracy may be compensated for, in part, by increasing the sample 
size. Moreover, more informative items tended not to be as well estimated as less informative 
items. The results appear to indicate that if one is interested in estimating ability, a sample size 
ratio of 5 : 1 can produce reasonably accurate item parameter estimates for this purpose. 
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Item response theory (IRT) has emerged as a popular approach foi solving various 
measurement problems. IRT is used in state testing programs such as the Maryland State 
Department of Education’s High School Functional Assessment program as well as in municipal 
programs, such as the Portland School district. Both of these programs use IRT for test equating 
and the Portland program also uses IRT for test design (Ferrara, personal communication, October 
4, 1991; Kingsbury, personal communication, Nov. 19, 1991; Forster, 1987). The nationally 
available California Achievement Test and the California Test of Basic Skills (Fourth Edition) are 
designed and equated using IRT (CTB/McGraw-Hill, 1987; CTB/MacMillan/McGraw-Hill, 1991). 
Moreover, certification boards such as the Ani Tican Society of Clinical Pathologists have an IRT- 
based adaptive testing program for certification (Bergstrom & Lunz, 1991). 

Most IRT work has been based on binary models such as the one- and three-parameter logistic 
models. With these models an individual’s response is categorized as either correct or incorrect. 
However, not all examinee-item interactions may be appropriately modeled by binary models. For 
instance, to capture the information in a Likert item or to assign credit for a partially correct 
answer requires a model that contains more than two categories. Moreover, because the 
distributions of wrong answers over the options of multiple-choice items differ across ability 
levels (Nedelsky, 1954; Levine & Drasgow, 1983), it is possible and may be desirable to use a 
model that can assess information from all item options rather than to use a model which assumes 
an examinee either knows the correct answer or randomly selects an incorrect alternative. In 
addition, the one- and three-parameter logistic models do not incorporate findings from human 
cognition studies (e.g.. Brown & Burton, 1978: Brown & VanLehn, 1980; Lane, Stone. & Hsu. 1990; 
Tatsiioka, 1983). For instance. Tatsuoka's (1983) analysis of student misconceptions in 
performing mathematics problem showed that wrong responses could be of more than just one 
kind. However, dichotomous scoring uniformly assigned a score of zero to all the wrong 
responses. In this regard, an item's incorrect alternatives may augment our estimate of an 
examinee’s ability by providing information about the examinee's level of understanding (i.e., 
provide diagnostic information). 

In contrast to binary models, polytomous models contain more item parameters to estimate. 
Because of these additional parameters potentially larger sample sizes may be required for their 
accurate estimation. For exaniple. for Masters (1982) panial credit model (PCM) a 2 : 1 or larger 
ratios of examinees to item parameters were needed to produce stable item and ability parameter 
estimates, regardless of the number of categories (Walker-Bartnick. 1990). For Samejima's (1969) 
graded response model (GRM). Reisc and Yu (1990) recommended that at least 500 examinees arc 
needed to achieve an adequate calibration with the GRM. Their study was conducted with a 25- 
item test and therefore their guidelines may only be appropriate for tests of this length. (With 
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longer tests it may be necessary to increase the sample size.) Similar findings were reported by 
Ankenmann and Stone (1992). 

One polytomous model for which item parameter recovery has not been studied is Bock's 
(1972) nominal response model (NRM). The NRM is appropriate for items with unordered 
responses. The NRM may be used in computerized adaptive testing (De Ayala, 1992), with testlets 
(Wainer & Kiely, 1987) to solve various testing issues, such as multidimensionalily (Thissen, 
Steinberg, & Mooney, 1989), with items that do not have a *'correcr response, such as 
demographic items (e.g„ to provide ancillary information), and with items whose alternatives 
provide educational diagnostic information. Moreover, with the increasing popularity of 
performance -based assessment, the use of po nomous IRT models, in general, and the NRM in 
particular, will more than likely see increase ';pplication. 

The objective of this study was to establish guidelines for obtaining reasonably accurate item 
parameter estimates for the NRM. Because it was believed that the ratio of the sample size to 
number of parameters to be estimated is more useful than the actual sample size used, one factor 
studied was the sample size ratio (SSR). For instance, simply because the use of 100 examinees 
allows accurate Rasch parameter estimation with a 20 item test does not necessarily imply that 
only 100 examinees are required to obtain good estimates with a 100 item test. In this study, 
three ratios of observations to number of item parameter to be estimated were investigated: 5:1, 
10 : 1. and 20 : 1. 

Previous parameter recovery studies (e.g.. Ankenmann & Stone. 1992; Reise & Yu. 1990) have 
varied the discrimination parameter. For example. Reise and Yu classified item discrimination 
into three ranges, high, medium, and low. However, because with the NRM there are multiple 
discrimination parameters for each item such a scheme did not appear to be useful. Further 
complicating the issue is the fact that when the number of categories is three or more, different 
combinations of an item's slopes and intercepts can produce the same maximum amount of 
information dm ax) value. Therefore, establishing guidelines in terms of the magnitude of the 
slope vectors was not pursued. Rather, in order to establish a design with the characteristic of 
“high", "medium", "low" discrimination, it was noted that the primary importance of the 
discrimination parameter is its effect on item information. Therefore, one may re conceptualize 
the Reise and Yu study as using items that are "high", "medium", and "low" in information rather 
than in terms of discrimination parameters. As such, values for Imax vvere set a priori and a 
slope vector to obtain a specific Imax determined. The Imax valu'..s studied were 0.25, 0.16. 

and 0.09; for dichotomous models these Imax^ ^re equivalent to items with discriminations of 1.0. 
0.8. and 0.6. respectively. 

Because the accuracy of estimating items located at various points along the ability (9) scale 
may be affected by the latent 8 distribution (LD), a third factor investigated was the effect of the 
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LD. Three distributions, normal, positively skewed, and uniform, were studied. An additional 
factor used in the study was whe-lher the item consisted of three or four options. 

Model 

The NRM assumes that item alternatives represent responses which are unordered. The NRM 
provides a direct expression for obtaining the probability of an examinee with ability 0 
responding in the j-th category of item i as: 

expCcii + diiS) exp(aii(0 - bii)) 

= 

X exp(cij + aij0) £ exp(ajj(0 - bij)) 
h=l h=l 

where ay and cy are the slope and intercept parameters, respectively, of the nonlinear response 
function associated with the j-th category of item i. and mj is the number of categories of item i 
(i.e., j = 1, 2, .... mj). For convenience the slope and intercept parameters are sometimes 

represented in vector notation, where a = (aii, ai2 aim) and c = (cii. Ci2» ...» Cim). The aijs are 

analogous to and have an interpretation similar to traditional option discrimination indices. That 
is, a crosstabulation of ability groups by item alternatives shows that a category with a large aij 
reflects a response pattern in which as one progresses from the lower ability groups to the higher 
ability groups there was a corresponding increase in the nurhber of persons who answered the 
item in that category and for categories with negative ays this pattern is reversed. The intercept 
parameters reflect the interaction between a category's difficulty and how well it discriminates. 

It appears that, in general, large values of cy are associated with categories with large 
frequencies and as the value of cy becomes increasingly smaller the frequencies for the 
corresponding categories decrease. 

The probability of responding in a particular category as a function of 0 may be depicted by 
the option response function (ORF); other synonymous terms are category or option 
characteristic curve and trace line. Figure 1 contains the ORFs for a three-category (m = 3) 
item with a = (-0.75, -0.25, 1.0) and c = (-1.5. -0.25. 1.75). 

Insert Figure 1 about here 



The intersection of the ORFs can be obtained by setting adjacent category multivariate logit equal 
to one another and solving for 0. In general, for any item with mj > 2 and because 0 and b are on 
the same scale: 



where k = 2...mj and there arc mj - 1 ORF intersection points. This formulation is analogous to 
the step difficulties in the PC model. 



b 
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METHOD 

Programs: MULTILOG (Thissen, 1988) was used lo obtain item parameter estimates for the NRM 
using default program parameters. A data generation program for generating responses according 
to the NRM was also written. 

Data: A series of data sets were created. Each data set consisted of responses to 28 items and the 
data sets diffe^-ed from one another on the basis of Ir.iax» number of item options, the form of 
the ability d'.siiibution from which the simulees were sampled, and the SSR. The 28 item set was 
created by determining for a given Imax level the c vector needed to locate the items’ location (the 
average of the Z?js) at one of the seven scale points between -3.0 to 3.0 in increments of 1 logit. 

For example, for a four-option item for the 0.25 Imax condition a = (0.450, -0.150, -0.100. 

-0.200) and to locate this item at -3.0 one would use c = (0.926. -0.275, -0.125.-0.525). (That is, 
the item's location = (b\ ^ bi + Z?3)/3 with b\ = -2.00, b2 = -3.000, and b^ - -4.000 and the /?js 
are always one logit apart.) In this fashion seven items were created that spanned the usual 0 
range used in IRT and these items were replicated to produce the 28 item set. 

For the three-option set of items the number of parameters to be estimated was 168 ((3 aijs + 

3 cijs) X 28 items) and for the four-option item set there were 224 item parameters to estimate. 

With SSRs of 5 : 1, 10 : 1, and 20 : 1 this produced, for the three-option items, sample sizes of 

840, 1680, and 3360, respectively, and for the four-option items samples of 1120. 2240, and 
4480, respectively, were needed. For a given LD condition, the appropriate numbci of zs was 
sampled from a normal (0,1) distribution, a beta distribution (df] = 1.25. df2 = 10), or a uniform 
distribution [-4, 4]. These zs were considered to be the simulees' true 0s and the 0s plus the 28 
item parameters were used to generate polytomous response strings with a random error' 
component for each simulated examinee. Generation of an examinee's polytomous response string 
was accomplished by calculating the probability of responding to each alternative of an item 
according to the NRM. Based on the probability for each alternative, cumulative probabilities 
were obtained for each alternative. A random error component was incorporated into each 
response by selecting a random number from a uniform distribution [0,1] and comparing it to the 
cumulative probabilities. The ordinal position of the first cumulative probability which was 
greater than the random number was taken as the examinee's response to the item. 

For each of the (3 SSRs X 3 LDs X 3 ImaxS X 2 mis=) 54 conditions twenty-five replications 

were performed. That is, for a given condition (e.g., I^iax = normal 0 distribution. 20 : 1 SSR. 
4-option items), twenty-five unique response data sets were generated and each was calibrated 
using MULTILOG. This produced twenty-five sets of item parameter estimates for each set of item 
parameters. For a given combination of the LD and SSR factors, the same examinees were used for 
each of the Imax factor levels (i.c., Imax Vr'as a repeated measures factor). 
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Equating: Because of the ind'jierminacy of the ability scale, calibration programs define the scale 
so that the mean and standard deviation of 0 (or /?) are 0 and 1, respectively, for the calibration 
group. Therefore, the use of scale dependent accuracy measures, such as RMSE and average 
absolute deviation, require that the item parameter estimates be place on the parameter scale. 

The relationship between the item parameter estimate metric and the item parameter metric is a 
linear one. The basic transformation is: 

0’ = a0 + K (3) 

a =“ (4) 

a ^ ^ 

b' = ab 'h K ( 5 ) 

where 0‘, a\ and b' are the transformed parameters corresponding to 0, and b. and a and k are 
the slope and intercept equaling constants, respectively. In the context of the present discussion 
a\ and b' are on the parameter (target) metric, whereas 0,^7, and b are on the estimate (base) 
metric. 

The determination of the a and k may be accomplished in a number of ways. For instance. 
Slocking and Lord (1983) have developed a procedure for obtaining the equating constants based 
on test characteristic curves (TCCs); this procedure has been implemented in the EQUATE 2.0 
program (Baker, 1993a) for the binary models, the GRM, and the NRM (Baker, 1992, Baker. l993b. 
Baker & Al-Karni, 1991). An alternative method using the mean difficulty and the mean 
discrimination for obtaining a and k was presented by Loyd and Hoover (1980). 

Because the Loyd and Hoover (LH) method is more parsimonious than the Stocking and Lord 
approach, as well as for other pragmatic reasons 1, the LH method was generalized to the nominal 
response model and used for equaling the NR item parameter estimates with the item parameters. 
The LH method specifies that: 



a 




a = — 


(6) 


a ' 




K = T)' - ab 


(7) 



Given that the slope-intercept form of the NRM multivariate logit for item i category j may be 
reparameierized as: 

cjj + ajj0 = aij(0 - bjj) 

'.nd because cy = -aijbjj, one obtains across items that for category j: 
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Therefore, sums are taken across the common items and by substitution as well as by noting that 
bj= » , one obtains: 



( 8 ) 

ai_ ci ai_ c;-ci 

Kj = bj- abj = bj - ^ bj = - ^ ^ bj ^ (9) 

aj aj aj aj 

Equations (8) and (9) are the EQ-NR method. The equaling constants may then be applied to 
transform one metric to another: 






aij — 

J a 



( 10 ) 



cij = cjj - aijKj (11) 

where aij and cjj are the equated (transformed) slope and intercept parameters, respectively, and 
aij and cy are the uniransformed slope and intercept parameters, respectively. 

Table 1 contains an example of the application of the EQ-MR method. NRM item parameters for 
four 4-opiion items were randomly generated and transformed to item parameter "estimates'* by 

I a] 1 t 4 

applying the reparameterized forms of (4) and (5) (i.e., 3ij="^ and cjj = cij - Kaij), where a = 0.4 

and K= 1.3. The estimates were then transformed back to the parameter metric by application of 
the EQ-NR method; a = (2.5. 2.5. 2.5, 2.5) and k = (-3.25, -3.25. -3.25. -3.25). As can be seen, the 
equated item parameter estimates are equal to the parameters. (The application of the LH method 
to ordered polytomous models, such as the PCM and the GRM, is a direct extension the binary 
case*-.) The major advantages of the EQ-NR method are its simplicity and that no special software 
is necessary for its implementation. However, its robustness in real-world applications needs lo 
be investigated. 



Insert Table 1 about here 



Analysis: The accuracy of item parameter estimation was assessed by root mean square error 
(RMSE). RMSE was calculated according to: 

(Aij - A i ; ) ^ 

where Ay is the equated item parameter estimate (either ay or cy) for item i option j. Ay is the 
corresponding item parameter (either ay or cjj), and nr, the number of replications, equaled 25. 

The analysis of the 3- and 4-caiegory cases were treated separately as were the slope and 
intercept parameters. The basic design was a two-group repeated measures with LD and SSR as the 
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between subjects factors and Imax as the within subjects factor. Because Xaj = 0 and ]£cj = 0. a 
and c do not consist of m\ independent item parameter estimates and the RMSE for each item 
option parameter estimate could not be used as the depe;^dent variable. Therefore, the mean 
RMSE(A) across item options and across replicates was use i as the dependent variable. 

It w'as expected that the accuracy of item parameter estimation would be related to the 
distribution of responses across item options. A measure of the distribution of responses across 
item options was obtained by using the index of dispersion, D: 



where Nj is the number of examinees responding to item i and njj is the number of examinees 
responding in option j for item i. D has a range from 0.0 to 1.0 (inclusive) with D = 0.0 indicating 
that all responses to an item are in one option and D = 1.0 signifying that responses are evenly 
distributed across all options. 



Table 2 contains descriptive statistics on the latent ability distributir ns for each SSR as well 
as the mean correlation between the item parameter and its estimate (i.e., the average correlation 
between the option parameter and its estimate across the number of item options, and the 
correlations were converted to zs before averaging). As can be seen for a given LD, increasing the 
SSR was associated with an increase in regardless of the number of item options. Similarly. 

increasing the SSR produced an increase however these increases were not as dramatic due to 
the strong linear relationship between c and c at the 5 : 1 SSR. For a given LD and SSR level the 
r^a^ '^'ere consistently larger for the three category condition than for the four-option category. 
For a given SSR condition the were largest for the uniform 9 distributions and smallest for 
the positively skewed 9 distributions, regardless of the number of item options. 



Figure 2 contains plots of D versus an item’s average RMSE(a) for the 5 : 1 and 20 : 1 SSRs for 
the three- and four-option item sets; the 10 : 1 SSR plot falls predictably between the 5 : 1 and 20 
: 1 SSR plots. As can be seen there is an inverse relationship between D and the mean RMSE(a). 
the average RMSE for an item decreased as the distribution of responses across an item’s option 
increased. Specifically, for the three-opiion item sen the correlations between D and the mean 
RMSE(a) were -0.597, -0.647, 0.647. for the 5 : 1. 10 : 1. and 20 : 1 SSRs and the corresponding 
correlations for the four-option items were -0.348. -0.438, and 0.485. For the intercepts le 



D = 




(13) 



N7 ( m i - 1 ) 



RESULTS 



Insert Table 2 about here 
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correlations for the 5 : 1, 10 : 1, and 20 : 1 SSRs/lhree-option items were *0.647. -0.569, and 
“0.512 and for the four-option items -0.331, -0.324, and -0.303, respectively. In general, the 
lowest RMSEs and larger Ds were associated with the uniform 0 distribution, whereas the highest 
RMSEs and smaller Ds occurred with the positively skewed 0 distribution, regardless of SSR. 

Moreover, for a given SSR level the mean D was less for the three-option item set (D 5 ;i = 0.863, 
^10:1 = 0.869, D20:1 = 0.864) than for the four-option item set (D 5 ;i = 0.927. Di(-i = 0.928, 
^20; 1 "= 0.928). The uniform 0 distribution resulted in the greatest distribution of responses 
across item options (D 4 . option = 0.940, D3„option “ 0.902), with the normal and positively 
skewed 0 distributions having approximately the same average D values (normal: D 4 . option = 
0.922, D3-optirii = 0.848; positively skewed; b4-nption = 0.920, D3-option = 0.840). 

Insert Figure 2 about here 



The repeated measures analysis of the slope parameter (4-option items) is presented in Table 
3. As can be seen, the accuracy of estimating the slope parameters was influenced by the 
interaction of the LD with the SSR and the Imax- Post hoc comparisons for the Iniax factor showed 
that the slope parameters for items with Imax = 0.16 or the Imax = 0.09 (mean RMSE(a) = 0.060 
and mean RMSE(a) = 0.057, respectively) were estimated significantly more accurately than for 
items with Imax = 0.25 (mean RMSE(a) = 0.071). 

Analysis of the LD X SSR interaction showed that the average RMSE(a) for the uniform 0 
distribution was significantly less than that for either the normal or positively skewed 
distributions for all levels of the SSR factor and that the normal distribution mean RMSE(a) was 
significantly less that of the positively skewed ability distribution for the 5 : 1, 10 : 1, and 20 : 1 
SSRs. Moreover, doubling the SSR led to significant reductions in the average RMSE(a) for the 
normal and the positively skewed 0 distributions. Roughly speaking, quadrupling the sample size 
led to a halving of the average RMSE(a) for the 5 ; 1 ratio. However, despite the increase in 
examinees for the positively skewed 0 distribution the accuracy of estimation using the 20 : 1 
ratio (mean RMSE(a) = 0.0636) only approximated that for the normal 0 distribution using a 10 : 1 
ratio (mean RMSE(a) = 0.0629). In addition, it took a 20 ; 1 ratio with the normal 0 distribution 
to produce an average RMSE(a) (mean RMSE(a) = 0.0430) approaching that for a uniform 0 
distribution based on a 5 : 1 SSR (mean RMSE(a) = 0.0427). 

Insert Tabic 3 about here 



Figure 3 contains the mean RMSE(a) for the SSR X LD interaction for the slope parameters Rv 
the four-option item sets. As can be seen, when 0 is positively skewed twice as many subjects are 
needed in order to estimate the slope parameters approximately as accurately as when 9 is 
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normally distributed. For example, the mean RMSE(a) for the positive skewed LD condition using 
a 20 : 1 SSR is comparable to that with normal distribution and a 10 : 1 SSR. Similarly, with the 
positive skewed LD condition a 10 : 1 SSR results in a mean RMSE(a) tliai is slightly better than 
obtained using half as many subjects from a normal distribution. With a uniform distribution of 
ability even a 5 : 1 SSR provides more accurate estimation than can be obtained with four times as 
many subjects from a positively skewed ability distribution and almost comparable to that 
obtained when ability [$ normally distributed. ' 



The analysis of the intercept parameters (4 option items) showed significant main effects for 
both the Imax and SSR factors (Table 4). Post hoc analyses showed that doubling the SSR did not 
lead to a significant reduction in the mean accuracy with which the intercept parameters were 
estimated. However, increasing the SSR from 5 : 1 to 20 : I led to almost halving the average 
RMSE(c): mean RMSE(c) for the 5 : 1, 10 : 1, 20 : 1 levels were 0.085, 0.061, 0.044, respectively). 
Similar to the case with RMSE(a), increasing Imax levels were associated ith increases in the 
mean RMSE(c). As the item information increased from 0.09 to 0.16 to 0,25, there were 
significant decreases in the accuracy with which the intercept parameters were estimated (for 
Imax =0.09: mean RMSE(c) = 0.049. for Imax =0.16: mean RMSE(c> = 0.059, and for Imax =0.25: 
mean MSE(c) = 0.081). 



Tables 5 and 6 contain the repeated measures analyses for the slope and intercept 
parameters for the three-option item set, respectively. Analysis of the significant SSR main 
effect for the slope parameter showed that doubling the SSR from 5 : 1 to 10 : 1 led to a 
significant reduction in the. mean RMSE(a) (0.086 and 0.062. respectively), however, no 
significant improvement was realized by doubling the 10 : 1 SSR; mean RMSE(a) for 20 : 1 level 
was 0.047. Quadrupling the 5 : 1 SSR also led to significantly more accurate slope parameter 
estimates, on average. However, given the above results this would appear to be Uiinecessary to 
use an SSR greater than 10:1 with 3 option items. The significant Imax ^ LD interaction 
showed that, regardless of Imax level, that the uniform LD resulted, on average, in the most 
accurate RMSE(a) and the positively skewed LD the least accurate (Figure 4). In general, the 
slope parameters for the Imax = H-25 level items were significantly more poorly estimated than 
for the Imax ~ 0.16 level items for all LDs and, except for the normal LD level, the average 
RMSE(a)s for the Imyv = O.U^ level items svere significantly greater than for the Imax = 0.(F^ 



Insert Figure 3 about here 



Insert Table 4 about here 



insert Table 5 and Figure 4 about here 
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Analysis of the RMSE(c) for the ihree-opiion items revealed results that paralleled those for 
the four-option items. Specifically, increasing the SSR from 5 : 1 to 20 : 1 led to a significant 
reduction in the average RMSE(c); Uie mean RMSE(c)s for the 5 : 1, 10 : 1, and 20 : 1 SSR levels 
were 0.088, 0.062 and 0.045, respectively. As was the case with the four-option items, more 
informative item sets were not as well estimated, on average, as the less informative item sets; 
mean RMSE(c)s were 0.048, 0.063, and 0.083 for the 0.09, 0.16, and 0.25 Imax levels, 
respectively. The mean RMSE(a) and RMSE(c) for the three-option items were comparable in 
magnitude to those of the four-option item set. 

insert Table 6 about here 



DISCUSSION 

The use of marginal maximum likelihood estimation allows one to obtain item parameter 
estimates prior to estimating the examinees' 6s. Obtaining the ^s may be performed using 
maximum likelihood, expected a posteriori (EAR), or maximum a posteriori estimation techniques 
and treating the item parameter estimates as known quantities. As such, SSR and LD's effect on 
the ^s will be indirect (if at all) and only through their effect on the accuracy of estimating the 
item parameters. For this reason this study focused only on the accuracy of estimation of NRM's 
item parameters. 

Results showed that as the latent 6 distribution departs from a uniform distribution the 
accuracy of estimating the slope parameter decreases. In these cases, in order to increase the 
accuracy of estimating the slope parameter one needs to increase the sample size. The effects of 
the form of 6 distribution on RMSE may, in part, be related to the distribution of responses across 
item options. It was found that the uniform LD produced the greatest dispersal of responses 
across item options and that the positively skewed LD produced least variability in the examinees 
responses. Therefore, if there are insufficient numbers of examinees responding to a particular 
item option, then that option will not be as accurately estimated as other options that have 
attracted more examinee responses. (It should be noted that poor estimation of an option’s 
parameter^ may affect the estimation of the other options' parameters.) Short of rewriting the 
option, increasing the sample size is one means of increasing the number examinees responding to 
a particularly unattractive option. Moreover, more informative items (i.e., items with larger slope 
parameters) tended not to be as well estimated as less informative items. However, the RMSE(a) 
observed for these informative items (e.g., I^ax - 0.25) may be considered adequate by some. 
Similar findings were found with the intercept parameter. In particular, the more informative the 
iieius the greater the number of subjects required i:* order to estimate the intercepts with a 
degree of accuracy comparable to that of less informative items. This was true for both three- and 
four-option items. 
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Given the magnitude of the average RMSEs observed^ were the significantly more accurate item 
parameter estimates obtained by increasing the SSR meaningfully more accurate? In a real w'orld 
application there could be substantial costs involved in doubling or quadrupling the SSR (if it 
could be done at all). To answer this question an additional set of analyses based on confidence 
intervals (CIs) were performed. 

For each of the original six item parameter pools, a data set was generated according to the 
NRM that contained the responses of 1100 simulees. These simulees were distributed such that 
100 simulees were located at each of 1 1 0 points between -2.5 and 2.5 in 0.5 logit increments (i,e., 

lOO simulees had 0 = -2.5, 100 simulees had 0 = -2.0 100 simulees had 0 = 2.5). For each of 

these 1100 simulees the EAR 6 and its standard error of estimation were obtained using the item 

parameter estimates from each replication as well as the item parameters used to generate the 
data: for EAR estimation 80 quadrature points and a uniform prior ^was used. Because there were 
25 replications for each condition there were 25 ^s for each simulee and for a given condition 
there was a total of 1100 X 25 = 27,500 §s. For each of these $s a 95% Cl was calculated and for a 

given condition the number of times the Cl contained 0 was recorded. Table 7 contains the results 

of these analyses. 

Insert Table 7 about here 



As can be seen from top half of Tabic 7, while there were differences in the proportion of 95% 
CIs containing 0 across Imax' for given LD and Imax condition increasing the SSR did not appear 
to result in meaningful differences in the proportion of CIs containing 0. In general, the entries 
approxiu.ated the expected value of 0.950. Alternatively, the CIs based on the item parameters 
give an indication of how well one could expect to do given the sample size used. The differences 

between the CIs calculated on the basis of the item parameters and their estimates are presented 

in the bottom half of Table 7. These differences are typically on the order of one one thousandths. 
Overall, the largest differences are found for the 5 : 1 SSRs. However, these are. small differences. 
In this regard, it appears that if one’s focus is to use item parameters for ability estimation, a 5 : 

1 SSR may produce item parameter estimates that are reasonably accurate. 

This Cl approach (the top half of Table 7) may be used to compare different sets of item 
parameter estimates for meaningful differences wdth respect to $s. (If competing models are to be 
compared, then a model independent simulation data set would be used.) The Cl method has the 
advantages of simplicity, a clearly define and objective goal, an indication of how well or poorly 

one is doing in ability estimation, and. if desired, the possibility of significance testing. 
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^According to the documentation which accompanies EQUATE 2.0 (Baker. 1993a) the "nominally 
scored test equating is quite sensitive to the values of the initial estimators [a and k]". Moreover 
"the interaction among sample size, the estimation techniques employed in MULTILOG, and the 
equating coefficients yielded by EQUATE are in need of further investigation" (Baker, 1993b. p 
248). 

^The application of the EQ-NR approach to ordered polytomous models (EQ-OR) is done 
threshold-wise for obtaining the ks for the thresholds (kj = bj- abj) and item-wise for 
a 

obtaining a (a = ). The transformation of the item discrimination and the thresholds is 

a 

. ai 

performed by ^i = ^ t>ij = abij + kj. 




iV 



Table 1: Example equating using EQ-NR method. 



Parameter 



Item 


ai 


U2 


^3 


a4 


Cl 


C2 


C3 


C4 


I 


- 1 .937 


-0.702 


0.516 


2.123 


-2.040 


-0.781 


1.333 


1 .488 


2 


- 1 .549 


-1 .361 


1 .039 


1.870 


-1 .615 


-0.913 


0.719 


1.809 


3 


-2.126 


-0.829 


1 .216 


1.739 


-2.434 


-1.185 


0.949 


2.670 


4 


-2.326 


-1.155 


1.131 


2.350 


-1 .527 


-1.379 


0.527 


2.378 


Mean 


- 1 .984 


-1.012 


0.976 


2.020 


-1 .904 


-1 .065 


0.882 


2.087 . 


Estimates 


I 


-4.842 


-1 .755 


1 .289 


5.308 


4.255 


1 .5 00 


-0.343 


-5.412 


2 


-3.871 


-3.402 


2.598 


4.675 


3.41 8 


3.509 


-2.658 


-4.269 


3 


-5.314 


-2.073 


3.040 


4.347 


4.475 


1 .5 10 


-3.004 


-2.981 


4 


-5.8 I 5 


-2.887 


2.828 


5.874 


6.033 


2.375 


-3.149 


-5.258 


Mean 


-4.961 


-2.529 


2.439 


5.05 1 


4.545 


2.223 


-2.289 


-4.480 


Equated 


I 


-I. 937 


-0.702 


0.516 


2.T23 


-2.040 


-0.781 


1 .3 33 


1 .488 


2 


-I .549 


-1.361 


1 .039 


1.870 


-1.615 


-0.913 


0.719 


1.809 


3 


-2.126 


-0.829 


1 .216 


1.739 


-2.434 


-1.185 


0.949 


2.670 


4 


-2.326 


-1.155 


1.131 


2.350 


-1.527 


-1.379 


0.527 


2.378 


Mean 


-I .984 


-1 .012 


0.976 


2.020 


-1.904 


- 1 .065 


0.882 


2.087 








Table 2: Descriptive Statistics on Ability Distributions and Item Parameters^ 







Mean 


SD 


Skew 


3-option 


iiems^ 


4-opiion 


items'’ 


Distributi on 


SSR 


0 


0 


e 




^ec 




he 


Normal 


5 : 1 


0.001 


1.001 


-0.024 


0.668 


0.991 


0.621 


0.988 




10 : 1 


-0.003 


0.999 


-0.005 


0.779 


0.995 


0.735 


0.995 




20 : 1 


-0.002 


0.998 


-0.011 


0.870 


0.998 


0.844 


0.997 


PS 


5 1 


-0.001 


0.731 


1.291 


0.529 


0.991 


0.500 


0.989 




10 : 1 


-0.003 


0.733 


1.332 


0.652 


0.996 


0.637 


0.995 




20 : 1 


0.002 


0.737 


1.297 


0.733 


0.998 


0.733 


0.997 


Uniform 


5 : 1 


-0.024 


2.308 


0.01 1 


0.862 


0.989 


0.845 


0.987 




10 : 1 


0.002 


2.315 


0.004 


0.920 


0.994 


0.897 


0.993 




20 : 1 


. -0.005 


2.309 


0.003 


0.946 


0.997 


0.929 


0.996 



^SD: Standard Deviation* PS: Positively Skewed 



^correlations converted to z-scores before taking the average 
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Table 3: RMSE Repeated Measures Analyses for slope part:meiers (4 options)^. 



Source 


ss 


df 


MS 




P 


Between Subjects 












LD 


0.099 


2 


0.049 


70.389 


0.000 


SSR 


0.05 0 


2 


0.025 


35.836 


0.000 


LD X SSR 


0.0 10 


4 


0.002 


3.554 


0.012 


Subj w/i Groups 


0.038 


54 


0.00 1 






Within Subjects 












^max 


0.006 


2 


0.003 


38.419 


0.000 


^max ^ LD 


0.001 


4 


0.000 


1 .5 80 


0.185 


^max ^ SSR 


0.000 


4 


0.000 


0.442 


0.778 


LD X SSR X Imax 


0.000 


8 


0.000 


0.355 


0.942 


^max ^ Subj w/i Groups 


0.009 


108 


0.000 







Post Hoc Comparison is for LD: 







SSR 




Hypothesis 


5 : 1 


10 : 1 


20 : 1 


Pnnil ''S lips 


4.240* 


2.588* 


2.522* 


Pnml vs Punif 


5.450* 


3.682* 


2.043* 


Pps vs Pun if 


9.690* 


6.269* 


4.565* 


Post Hoc Comparison ts for 


^max* 




Hypothesis 








^^0.09 ^^0.16 


2.391 






^^0.09 ^^0.25 


11.730* 






^^0.16 ''-^^^0.25 


9.339* 







^nml: Normal, ps: Positively Skewed, unif: Uniform 



Post Hoc Comparison ts for SSR: 



Hypothesis 


Normal 


LD 

PS 


Unif 


>^5:1 ^^10:1 


2.978* 


4.630* 


1 .210 


^^5:1 ''■^^‘20:1 


5.415* 


7.133* 


2.008* 


^^10:1 ''^*^20; 1 


2.437* 


2.502* 


0.798 




20 



20 



Table 4: RMSE Repealed Measures Analyses for intercept parameters (4 options)^. 



Source 


SS 


df 


MS 


F 


P 


Between Subjects 
LD 


0.005 


2 


0.002 


1 .692 


0.194 


SSR 


0.0 5.3 


2 


0.026 


1 8.492 


0.000 


LD X SSR 


0.001 


4 


0.000 


0.164 


0.956 


Subj w/i Groups 


0.077 


5 4 


0.001 






Within Subjects 


^niax 


0.033 


2 


0.016 


66.624 


0.000 


^max ^ 


0.00 1 


4 


0.000 


1 .524 


0.200 


Imax X SSR 


0.002 


4 


0.000 


1.783 


0.138 


LD X SSR X Imax 


0.001 


8 


0.000 


0.386 


0.926 


Imax X Subj w/i Groups 


0.026 


108 


0.000 







Post Hoc Comparison is for SSR; Post Hoc Comparison is for Imax- 



Hypothesis 




Hypothesis 




»^5:l ''^^^10:1 


-2.966 


f^0.09 >^0.16 


4.891* 


^"5:1 P20:l 


-4.932* 


^"0.09 PO.25 


15.934* 


^"10:1 ''-‘=^'20:l 


-1.966 


f^0.16 f^O.25 


1 1.042* 



^nml: Normal, ps: Positively Skewed, unif: Uniform 
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Table 5: RMSE Repealed Measures Analyses for slope parameters (3 options)^. 



Source 


SS 


df 


MS 


F 


P 


Between Subjects 












LD 


0.119 


' 2 


0.060 


66.458 


0.000 


SSR 


0.051 


2 


0.025 


28.271 


0.000 


LDXSSR 


0.007 


4 


0.002 


1.889 


0.126 


Subj w/i Groups 


0.048 


54 


0.001 






Within Subjects 












^max 


0.013 


2 


0.006 


75.923 


0.000 


Imax ^ 


0.001 


4 


0.001 


3.696 


0.007 


Imax X SSR 


0.000 


4 


0.000 


0.449 


0.773 


LD X SSR X 


0.000 


8 


0.000 


0.299 


0.965 


Imax X Subj w/i Croups 


0.009 


108 


0.000 







Post Hoc Comparison is for SSR: 



Hypothesis 


^^5:1 ^^10:1 


-3.807* 


^^5:1 ^^20:1 


-6.075* 


^^10:1 ^^20:1 


-2.268 



Post Hoc Comparison is for LD: 



Hypothesis 


0.09 


Imax 

0.16 


0.25 


l^nml vs 4 ps 


4.272* 


5.023* 


6.737* 


l^nml vs nunif 


5.740* 


5.061* 


4.895* 


4ps vs 4unif 


10.012* 


10.084* 


1 1.632* 



Post Hoc Comparison is for Ini ax* 







LD 




Hypothesis 


Normal 


PS 


Unif 


1^0.09 ''s 1^0.16 


1 .344 


2.886* 


2.738* 


1^0.09 ''s 1^0.25 


4.721* 


9.779* 


6.455* 


1^0.16 1^0.23 


3.377* 


6.893* 


3.717* 



^nml: Normal, ps: Positively Skewed, unif: Uniform 



2 



2 




Table 6: RMSE Repealed Measures Analyses for intercept parameters (3 options)^. 



Source 


ss 


df 


MS 


F 


P 


Between Subjects 
LD 


0.007 


2 


0.003 


2.278 


0.112 


SSR 


0.059 


2 


0.030 


1 9.720 


0.000 


LX) X SSR 


0.001 


4 


0.000 


0.085 


0.987 


Sub] w/i Groups 


0,0 81 


54 


0.001 






Within Subjects 


^max 


0.039 


2 


0.019 


94.249 


0.000 


^max ^ LD 


0.001 


4 


0.000 


1.755 


0.143 


^max ^ SSR 


0.002 


4 


0.000 


2.109 


0.085 


LD X SSR X Imax 


0.000 


8 


0.000 


0.245 


0.981 


I max X Subj w/i Groups 


0.022 


108 


0.000 







Post Hoc Comparison ts for SSR; Post Hoc Comparison ts for Imax^ 



Hypothesis 




Hypothesis 






^'5:l ''S^'10:1 


-3.017 


^'0.09 ''s ^'0.16 


8.040* 




^"5:1 ^"20:1 


-5.099* 


^^0.09 ^^0.25 


19.326* 




^^10:1 ^^20; I 


-2.083 


^'0.16 ''s ^'0.25 


1 1.285* 





^nml; Normal, ps; Positively Skewed, unif: Uniform 
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Table 7: 


Proportion of 


times 95% 


confidence 


intervals contained 0. 












3-options 






4-options 










Imax 






^max 




Distribution SSR 


0.09 


0.16 


0.25 


0.09 


0.16 


0.25 


Normal 


parameters 


0.952 


0.956 


0.943 


0.946 


0.936 


0.952 




5 : 1 


0.950 


0.956 


0.941 


0.943 


0.931 


0.952 




10 : 1 


0.952 


0.957 


0.942 


0.945 


0.935 


0.952 




20 : 1 


0.952 


0.956 


0.944 


0.944 


0.935 


0.952 


PS 


parameters 


0.952 


0.956 


0.943 


0.946 


0.936 


0.952 




5 : 1 


0.944 


0.952 


0.938 


0.936 


0.933 


0.946 




10 : 1 


0.952 


0.954 


0.941 


0.945 


0.937 


0.950 




20 : 1 


0.953 


0.955 


0.942 


0.948 


0.937 


0.951 


Uniform 


parameters 


0.952 


0.956 


0.943 


0.946 


0.936 


0.952 




5 : 1 


0.951 


0.957 


0.939 


0.941 


0.929 


0.947 




10 : 1 


0.952 


0.957 


0.941 


0.942 


0.931 


0.947 




20 : 1 


0.952 


0.958 


0.941 


0.941 


0.93) 


0.947 



Differences between 


CIs based on 


item parameter 


estimates 


and CIs 


based on 


item parameters 


Normal 5 : 1 


-0.002 


0.000 


0.002 


-0.003 


-0.005 


0.000 


10 : 1 


0.000 


0.001 


0.001 


-0.001 


-0.001 


0.000 


20 : 1 


0.000 


0.000 


0.001 


-0.002 


-0.001 


0.000 


PS 5:1 


-0.008 


-0.004 


0.005 


-0.010 


-0.003 


-0.006 


10 : 1 


0.000 


■0.002 


0.002 


-0.001 


0.001 


-0.002 


20 : 1 


0.001 


-0.001 


0.001 


0.002 


0.001 


-0.001 


Uniform 5 : 1 


■0.001 


0.001 


0.004 


-0.005 


-0.007 


-0.005 


10 : 1 


0.000 


0.001 


0.002 


-0.004 


-0.005 


-0.005 


20 : 1 


0.000 


0.002 


0.002 


-0.005 


-0.005 


-0.005 
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Figure Captions 

Figure 1 . Example ORFs for a three-category item» a = (-0.75. -0.25. 1.0) and c = (-1.5. -0.25. 
1.75). 

Figure 2a. D vs the mean RMSE(a)/item for 5:1 SSR. 3-option items. 

Fi gure 2b. D vs the mean RMSE(a)/item for 20:1 SSR, 3-option items. 

Fi gure 2c. D vs the mean RMSE(a)/item for 5:1 SSR. 4-option items. 

Figure 2d. D vs the mean RMSE(a)/item for 20:1 SSR, 4-option items. 

Figure 3., Mean RMSE(a) for the SSR and LD interaction, 4 -option items. 

Figure 4.' Mean RMSE(a) for the Iniax by LD interaction. 
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